STAT 301 Group Project Final Report¶

Authors: Ellie, Yuxi, Leen, Macy

Introduction¶

A GitHub repository is an efficient tool for code management and collaboration. Whether for personal learning, team development, or open-source projects, it is highly effective for both users and creators. In recent years, with the widespread adoption of the internet and the rise of the big data industry, the demand for repositories on GitHub has grown significantly. Understanding users' needs for repositories with different characteristics can help creators better align with user expectations and grasp industry trends, ultimately contributing to the long-term and sustainable development of the platform. Within this context, one measure of a repository's success and popularity is how many stars it accumulates on github. With the current domination of github in the tech space, developing methods to predict or increase a repository's popularity with users can be essential. This brings us to the following research questions.


Research questions:

  • Which fundamental characteristics of a repository influence its popularity?
  • Can these fundamental characteristics effectively predict the popularity of a repository?


Alignment with Existing Literature:

Previous research by Hudson Borges, Andre Hora, and Marco Tulio Valente (2016) in "Predicting the Popularity of GitHub Repositories" utilized multiple linear regression to analyze the factors influencing repository popularity. Similarly, in "Characterization and Prediction of Popular Projects on GitHub," Junxiao Han, Shuiguang Deng, Xin Xia, Dongjing Wang, and Jianwei Yin (2019) applied multiple linear regression to examine a different dataset. Despite the variation in data sources, both studies arrived at strikingly similar conclusions: the number of forks exhibits a strong positive correlation with the number of stars, establishing it as a significant predictor of repository popularity. In contrast, variables such as license type and repository creation time were found to have relatively minor impacts, underscoring the limited influence of these factors on the popularity of GitHub repositories. In our report, we utilized a different dataset, explored various input variables, and employed alternative model selection methods to investigate whether other variables could influence the popularity of repositories. In this way, the repository creators can get more comprehensive and accurate guidance to improve their repository’s popularity.


Dataset:

To address our research questions, this study will utilize data from Kaggle. This data was collected through GitHub search API, and contains information on the top 215,000 Github repositories constrained to the repositories with over 167 stars. It includes the following 24 variables:

In [1]:
# Main developer: YUXI
variable_description <- data.frame(
  Field = c("Name", "Description", "URL", "Created.At", "Updated.At", "Homepage", 
            "Size", "Stars", "Forks", "Issues", "Watchers", "Language", "License", 
            "Topics", "Has.Issues", "Has.Projects", "Has.Downloads", "Has.Wiki", 
            "Has.Pages", "Has.Discussions", "Is.Fork", "Is.Archived", "Is.Template", 
            "Default.Branch"),
  Description = c("The name of the GitHub repository",
                  "A brief textual description that summarizes the purpose or focus of the repository",
                  "The URL or web address that links to the GitHub repository",
                  "The date and time when the repository was initially created on GitHub",
                  "The date and time of the most recent update or modification to the repository",
                  "The URL to the homepage or landing page associated with the repository",
                  "The size of the repository in bytes, indicating the total storage space used by the repository's files and data",
                  "The number of stars or likes that the repository has received from other GitHub users, indicating its popularity or interest",
                  "The number of times the repository has been forked by other GitHub users",
                  "The total number of open issues",
                  "The number of GitHub users who are 'watching' or monitoring the repository for updates and changes",
                  "The primary programming language",
                  "Information about the software license using a license identifier",
                  "A list of topics or tags associated with the repository, helping users discover related projects and topics of interest",
                  "A boolean value indicating whether the repository has an issue tracker enabled",
                  "A boolean value indicating whether the repository uses GitHub Projects to manage and organize tasks and work items",
                  "A boolean value indicating whether the repository offers downloadable files or assets to users",
                  "A boolean value indicating whether the repository has an associated wiki with additional documentation and information",
                  "A boolean value indicating whether the repository has GitHub Pages enabled, allowing the creation of a website associated with the repository",
                  "A boolean value indicating whether the repository has GitHub Discussions enabled, allowing community discussions and collaboration",
                  "A boolean value indicating whether the repository is a fork of another repository",
                  "A boolean value indicating whether the repository is archived. Archived repositories are typically read-only and are no longer actively maintained",
                  "A boolean value indicating whether the repository is set up as a template",
                  "The name of the default branch"),
  stringsAsFactors = FALSE
)
cat("Table 1: Description of the variables in our dataset \n")
variable_description
Table 1: Description of the variables in our dataset 
A data.frame: 24 × 2
FieldDescription
<chr><chr>
Name The name of the GitHub repository
Description A brief textual description that summarizes the purpose or focus of the repository
URL The URL or web address that links to the GitHub repository
Created.At The date and time when the repository was initially created on GitHub
Updated.At The date and time of the most recent update or modification to the repository
Homepage The URL to the homepage or landing page associated with the repository
Size The size of the repository in bytes, indicating the total storage space used by the repository's files and data
Stars The number of stars or likes that the repository has received from other GitHub users, indicating its popularity or interest
Forks The number of times the repository has been forked by other GitHub users
Issues The total number of open issues
Watchers The number of GitHub users who are 'watching' or monitoring the repository for updates and changes
Language The primary programming language
License Information about the software license using a license identifier
Topics A list of topics or tags associated with the repository, helping users discover related projects and topics of interest
Has.Issues A boolean value indicating whether the repository has an issue tracker enabled
Has.Projects A boolean value indicating whether the repository uses GitHub Projects to manage and organize tasks and work items
Has.Downloads A boolean value indicating whether the repository offers downloadable files or assets to users
Has.Wiki A boolean value indicating whether the repository has an associated wiki with additional documentation and information
Has.Pages A boolean value indicating whether the repository has GitHub Pages enabled, allowing the creation of a website associated with the repository
Has.DiscussionsA boolean value indicating whether the repository has GitHub Discussions enabled, allowing community discussions and collaboration
Is.Fork A boolean value indicating whether the repository is a fork of another repository
Is.Archived A boolean value indicating whether the repository is archived. Archived repositories are typically read-only and are no longer actively maintained
Is.Template A boolean value indicating whether the repository is set up as a template
Default.Branch The name of the default branch

The selected dataset includes variables indicating the basic characteristics of the repository. In order to explore the research question, we plan to use the number of stars—which shows the popularity of a repository—as the response variable and use different input variables including number of forks, number of issues, the size of the repository (in KB), whether Discussions, Wiki, Pages and Projects are enabled and whether repository is set up as a template to find a model with good prediction power. The reason why we chose these input variables is because previous research with our selected dataset shows that some variables have little correlation with the popularity of a repository. We have decided to ignore said variables to cut computation times.


The dataset is very large, containing over 215,000 rows or observations (each corresponding to a repository), so we chose to take a stratified random sample of size 1,000 to use for all further data analysis, visualization and modelling.

Methods and Results¶

a) Exploratory Data Analysis¶

In [2]:
# Contributors: Ellie, Leen, Macy, Yuxi
library(broom)
library(repr)
library(infer)
library(gridExtra)
library(faraway)
library(mltools)
library(leaps)
library(glmnet)
library(cowplot)
library(modelr)
library(tidyverse)
library(dplyr)

install.packages("gridExtra")
Loading required package: Matrix

Loaded glmnet 4.1-8


Attaching package: ‘modelr’


The following objects are masked from ‘package:mltools’:

    mse, rmse


The following object is masked from ‘package:broom’:

    bootstrap


── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ modelr::bootstrap() masks broom::bootstrap()
✖ dplyr::combine()    masks gridExtra::combine()
✖ tidyr::expand()     masks Matrix::expand()
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::lag()        masks stats::lag()
✖ modelr::mse()       masks mltools::mse()
✖ tidyr::pack()       masks Matrix::pack()
✖ tidyr::replace_na() masks mltools::replace_na()
✖ modelr::rmse()      masks mltools::rmse()
✖ lubridate::stamp()  masks cowplot::stamp()
✖ tidyr::unpack()     masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

We first read in the dataset and selecting the variables we will be using for our analysis. We then take a stratified random sample of the data based on number of Stars ("high" or "low").

In [3]:
# Main developer: Ellie
set.seed(8035)
library(readr)
sample_size = 500
# Read in the dataset
repo <- read_csv("./repositories.csv")
# Remove unused columns, take stratified random sample grouped by Stars >= median and Stars < median
stars_med = median(repo$Stars)
repo_strat_sample <- repo %>%
    select(-Name, -Homepage, -Description, -Watchers, -URL, -'Created At', -'Updated At', -Language, -License, -Topics, -'Default Branch', -'Is Archived', -'Is Fork', -'Has Issues', -'Has Downloads') %>%
    mutate(no_stars = ifelse(Stars >= stars_med, "high", "low")) %>%
    group_by(no_stars) %>%
    sample_n(size = sample_size, replace = FALSE) %>%
    ungroup() %>%
    select(-no_stars)
cat("Table 2: Stratified Sample of Repositories \n")
head(repo_strat_sample)
Rows: 215029 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): Name, Description, URL, Homepage, Language, License, Topics, Defau...
dbl  (5): Size, Stars, Forks, Issues, Watchers
lgl  (9): Has Issues, Has Projects, Has Downloads, Has Wiki, Has Pages, Has ...
dttm (2): Created At, Updated At

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Table 2: Stratified Sample of Repositories 
A tibble: 6 × 9
SizeStarsForksIssuesHas ProjectsHas WikiHas PagesHas DiscussionsIs Template
<dbl><dbl><dbl><dbl><lgl><lgl><lgl><lgl><lgl>
28512 413119 3 TRUE TRUEFALSEFALSEFALSE
3627 952145132 TRUE TRUEFALSEFALSEFALSE
95861175167 29 TRUE TRUEFALSEFALSEFALSE
1248 848122 40 TRUE TRUEFALSEFALSEFALSE
64801283409 34 TRUE TRUE TRUEFALSEFALSE
3556 578104 2FALSEFALSEFALSEFALSEFALSE
In [4]:
cat("Table 3: Summary Statistics of the sample data\n")
summary(repo$Stars)
summary(repo$Forks)
summary(repo$Issues)
Table 3: Summary Statistics of the sample data
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    167     237     377    1115     797  374074 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
     0.0     39.0     79.0    234.2    174.0 243339.0 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    0.00     3.00    10.00    37.92    28.00 26543.00 

All the selected continuous variables in our dataset had right-skewed distributions. This skewness, as well as the extremely large ranges and the presence of extreme outliers, could potentially distort statistical analyses and reduce the interpretability of our data. This indicated the need for a transformation to address these challenges. Applying a log transformation seemed appropriate as it helps stabilize the variance, making the data more homoscedastic, and lessens the impact of extreme values. This approach makes the modeling process more reliable and helps us better understand the relationships between variables, especially when making predictions.

In [5]:
# Main developer: Leen
# Contributor: YUXI
options(repr.plot.width = 12, repr.plot.height =5) 
p1 <- ggplot(repo, aes(x = log(Size))) +
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
  ggtitle("Histogram of log(Size)") +
  xlab("log(Size)") +
  ylab("Frequency") +
  theme_minimal() +
theme(
    plot.title = element_text(size = 16, face = "bold"),  
    axis.title.x = element_text(size = 14),  
    axis.title.y = element_text(size = 14), 
    axis.text.x = element_text(size = 12),  
    axis.text.y = element_text(size = 12)    
  )

p1
cat("Figure 1: The histogram of the log(Size) ")
Warning message:
“Removed 151 rows containing non-finite outside the scale range (`stat_bin()`).”
Figure 1: The histogram of the log(Size) 
No description has been provided for this image

This is a histogram of one continuous variable Size after applying a log transformation to it. This is now much less skewed.

In [6]:
# Main developer: Leen
# Contributor: YUXI
#plot 2: scatterplot of stars vs forks, point size represents repository size and is colored for has_discussions
p2 <- repo_strat_sample %>%
    ggplot(aes(x = Forks, y = Stars, color = `Has Discussions`, size = Size)) +  
    geom_point(alpha = 0.6) +  # semi transparent points to make it easier to visualize even when overlapping
    scale_size(range = c(1, 10)) +  # adjusting size of the points based on size of repository
    scale_color_manual(values = c("red", "blue")) + 
    labs(title = "Combo Scatterplot: Forks, Stars, and Repository Size (Colored by Has Discussions)",
         x = "(Log) Forks",
         y = "(Log) Stars",
         size = "Repository Size",
         color = "Has Discussions") +
    scale_x_log10() +
    scale_y_log10() +
    theme_minimal()+
theme(
    plot.title = element_text(size = 16, face = "bold"),  
    axis.title.x = element_text(size = 14),  
    axis.title.y = element_text(size = 14), 
    axis.text.x = element_text(size = 12),  
    axis.text.y = element_text(size = 12)    
  )

p2
cat("Figure 2: A summarized scatterplot of log(Stars) vs log(Forks) ")
Figure 2: A summarized scatterplot of log(Stars) vs log(Forks) 
No description has been provided for this image

For this combination scatterplot, there seems to be a positive association between (logged) Forks and (logged) Stars, with repositories having discussions enabled (blue) being more common among those with higher star and fork counts. When examining the sizes of the data points, there doesn’t appear to be a clear pattern based on Star levels, suggesting a very weak or nonexistent relationship between (logged) Stars and repository Size.

In [7]:
# Main developer: Leen
# Contributor: YUXI
repositories <- repo %>%
  mutate(across(c(`Has Discussions`, `Has Wiki`, `Is Template`), as.factor))

p3 <- repositories %>%
  pivot_longer(cols = c(`Has Discussions`, `Has Wiki`, `Is Template`),
               names_to = "Variable",
               values_to = "Value") %>%
  ggplot(aes(x = Value, y = log(Stars), fill = Variable)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ Variable, ncol = 4) +
  labs(title = "Boxplots of Stars by Repository Features",
       x = "Feature Value",
       y = "Log(Stars)") +
  theme_minimal() +
  theme(legend.position = "none")+
theme(
    plot.title = element_text(size = 16, face = "bold"),  
    axis.title.x = element_text(size = 14),  
    axis.title.y = element_text(size = 14), 
    axis.text.x = element_text(size = 12),  
    axis.text.y = element_text(size = 12)    
  )

p3
cat("Figure 3: Boxplots Comparing Log(Size) by Repository Features (Has Discussions, Has Wiki, and Is Template) ")
Figure 3: Boxplots Comparing Log(Size) by Repository Features (Has Discussions, Has Wiki, and Is Template) 
No description has been provided for this image

For the Has_projects boxplot, it appears that repositories without projects tend to have a higher average number of stars. Similarly, the Has_wiki boxplot suggests that repositories without an associated wiki tend to have a higher average number of stars. Lastly, for the is_template boxplot, the number of stars does not vary significantly based on whether a repository is a template, although the average is slightly higher for repositories that are not templates. Overall, none of these differences seem substantial at first glance, which might indicate a lack of a strong relationship between the number of stars and these variables.

In [8]:
# Main developer: Ellie
# Log-transform Stars, Size, Forks, Issues columns, and set 'Has Discussions' as factor
repo_sample_log <- repo_strat_sample %>%
    mutate(Size = log(Size + 1),
          Stars = log(Stars + 1),
           Forks = log(Forks + 1),
           Issues = log(Issues + 1),
          `Has Discussions` = as.factor(`Has Discussions`))
cat("Table 4: The sample data when Log-transform Stars, Size, Forks, Issues columns, and set 'Has Discussions' as factor")
head(repo_sample_log)
Table 4: The sample data when Log-transform Stars, Size, Forks, Issues columns, and set 'Has Discussions' as factor
A tibble: 6 × 9
SizeStarsForksIssuesHas ProjectsHas WikiHas PagesHas DiscussionsIs Template
<dbl><dbl><dbl><dbl><lgl><lgl><lgl><fct><lgl>
10.2581156.0258664.7874921.386294 TRUE TRUEFALSEFALSEFALSE
8.1964376.8596154.9836074.890349 TRUE TRUEFALSEFALSEFALSE
9.1681637.0698745.1239643.401197 TRUE TRUEFALSEFALSEFALSE
7.1300996.7440594.8121843.713572 TRUE TRUEFALSEFALSEFALSE
8.7766307.1577356.0161573.555348 TRUE TRUE TRUEFALSEFALSE
8.1766736.3613024.6539601.098612FALSEFALSEFALSEFALSEFALSE

Now, lets take a closer look at our response variable Stars after the log transformation for both the sample and the full dataset. With this we can see if our range of values in the sample is representative of the full dataset.

In [9]:
# Main developer: Macy
# exploration of Stars column in sample
Stars_summary <- as.data.frame(as.list(summary(repo_sample_log$Stars)))
repo_Stars_exploration <- tibble(
    Data = "Sample",
    Min = Stars_summary$Min.,
    Median = Stars_summary$Median,
    Mean = Stars_summary$Mean,
    Max = Stars_summary$Max.,
    SD = sd(repo_sample_log$Stars))

Stars_summary <- as.data.frame(as.list(summary(log(repo$Stars + 1))))
repo_Stars_exploration <- rbind(
        model = repo_Stars_exploration,
        data = tibble(
    Data = "Full Dataset",
    Min = Stars_summary$Min.,
    Median = Stars_summary$Median,
    Mean = Stars_summary$Mean,
    Max = Stars_summary$Max.,
    SD = sd(repo_sample_log$Stars)))
cat("Table 5: Summary Statistics for Sample and Full Dataset")
repo_Stars_exploration
Table 5: Summary Statistics for Sample and Full Dataset
A tibble: 2 × 6
DataMinMedianMeanMaxSD
<chr><dbl><dbl><dbl><dbl><dbl>
1Sample 5.1239645.9335706.23098411.573941.027779
2Full Dataset5.1239645.9348946.21828212.832211.027779

We can see that both of these summaries are very similar, barring the slight difference in our max value. This indicates that our stratified sample is representative of our full dataset.

Next, we wil create a correlation matrix on a heatmap to check for multicollinearity or linear dependence among our covariates.

In [10]:
# Main developer: Ellie
options(repr.plot.width = 10, repr.plot.height = 8) 
repo_sample_log_bin <- 
    repo_sample_log %>%
    mutate(across(where(is.logical), as.numeric)) %>%
    mutate(`Has Discussions` = as.numeric(`Has Discussions`))
corr_matrix_repo <-
    repo_sample_log_bin %>%
    select(-Stars) %>%
    cor() %>%
    as_tibble(rownames = 'var1') %>%
    pivot_longer(-var1, names_to = "var2", values_to = "corr")
plot_corr_matrix_repo <-
    corr_matrix_repo %>%
    ggplot(mapping = aes(var1, var2)) +
    geom_tile(mapping = aes(fill = corr), color = "white") +
    scale_fill_distiller("Correlation Coefficient \n",
      palette =  "YlOrRd",
      direction = 1, 
      limits = c(-1, 1)
    ) +
    labs(x = "", y = "") +
    theme_minimal() +
    theme(
        axis.text.x = element_text(angle = 45, vjust = 1, size = 14, hjust = 1),
        axis.text.y = element_text(vjust = 1, size = 14, hjust = 1),
        legend.title = element_text(size = 18),
        legend.text = element_text(size = 12),
        legend.key.size = unit(1.5, "cm")
    ) +
    coord_fixed() +
    geom_text(aes(var1, var2, label = round(corr, 2)), color = "black", size = 6)
cat("Figure 4: Correlation Matrix of Repository Features and Metrics ")
plot_corr_matrix_repo
Figure 4: Correlation Matrix of Repository Features and Metrics 
No description has been provided for this image

Note that the largest correlation, in magnitude, is between Has.Wiki and Has.Projects, with a coefficient of 0.57. This is noticeable but not large enough to be cause for concern or to consider dropping either variable.

b) Methods: Plan¶

Overview of model selection method
Among our group members we used forward selection and Lasso as methods for model selection. Between these two methods, forward-selection resulted in a better predictive model, as assessed by Root Mean Squared Error (RMSE), so we chose to use this method in our final report. This approach begins with the intercept-only model and adds covariates one at a time, selecting the best model of each size (i.e. number of covariates) based on the lowest residual sum of squares (RSS). Based on these results we will select the model with the lowest $C_p$ value, which identifies the best predictive model regarding goodness of fit. To evaluate its performance, we will compare this reduced model against a full model (regressing Stars on all 8 covariates) using RMSE to analyze the predictive power of the selected covariates and address our research question.

Forward selection is appropriate for this analysis because it is computationally efficient, simple to implement, and adheres to the principle of parsimony, balancing predictive power and interpretability. This aligns with our goal of identifying the subset of covariates that will be most helpful in predicting Stars while maintaining interpretability.

However, forward selection has some limitations. It may not find the globally optimal model, especially if certain variables improve predictive power only when included together. Additionally, it focuses on minimizing in-sample error, which may not ensure the model generalizes well to new data. Comparing the RMSE of the reduced and full models will help address this concern.

By applying forward selection and evaluating the resulting model, we aim to identify a reduced model with strong predictive power and interpretability, effectively addressing our research question.


Addressing key assumptions:
(a) No severe multicollinearity
Earlier we checked for multicollinearity by creating a correlation matrix to identify significant linear associations or dependencies between covariates. We addressed these already.

(b) Approximately linear relationship
We assume a roughly linear relationship between response and covariates. Preliminary analysis suggests a fairly linear relationship between Stars and Forks, supporting this assumption.

(c) No ommitted key covariates
All variables excluded from the original dataset were intentionally removed in confidence that they are not relevant predictors for Stars.

(d) Independence of observations
We assume each repository is distinct and independent from one another.


Implementation of proposed model

We first split our data into a training and testing set, and reorder the columns so Stars is first.

In [11]:
# Main developer: Ellie
# Contributor: YUXI
set.seed(8035)
repo_sample_log <- 
    repo_sample_log %>%
    mutate(id = row_number()) %>%
    select(Stars, everything())

training_repo <-
    repo_sample_log %>%
    slice_sample(prop = 0.7, replace = FALSE)

testing_repo <-
    repo_sample_log %>%
    anti_join(training_repo, by = "id") %>%
    select(-id)

training_repo <- 
    training_repo %>%
    select(-id)
cat("Table 6: The training dataset")
head(training_repo)
Table 6: The training dataset
A tibble: 6 × 9
StarsSizeForksIssuesHas ProjectsHas WikiHas PagesHas DiscussionsIs Template
<dbl><dbl><dbl><dbl><lgl><lgl><lgl><fct><lgl>
6.93925410.1893055.3518583.850148 TRUE TRUE TRUETRUE FALSE
5.545177 7.9247962.8332133.433987 TRUE TRUEFALSEFALSEFALSE
6.401917 8.0677765.1929573.258097 TRUE TRUEFALSEFALSEFALSE
6.342121 3.4965084.6728291.386294 TRUE TRUEFALSEFALSEFALSE
7.65112012.5608027.1989311.945910 TRUEFALSEFALSEFALSEFALSE
5.241747 8.4364174.4659083.496508FALSEFALSEFALSEFALSEFALSE

We fit the full model including all covariates. We will use this as a baseline for later comparison.

In [12]:
# Main developer: Ellie
repo_full <- lm(Stars ~., training_repo)
summary(repo_full)
Call:
lm(formula = Stars ~ ., data = training_repo)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.93061 -0.46914 -0.05547  0.37084  2.40730 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            3.72549    0.13706  27.182  < 2e-16 ***
Size                  -0.01602    0.01111  -1.442 0.149675    
Forks                  0.53272    0.02358  22.593  < 2e-16 ***
Issues                 0.07921    0.01924   4.116 4.32e-05 ***
`Has Projects`TRUE     0.02845    0.09029   0.315 0.752780    
`Has Wiki`TRUE        -0.05800    0.07934  -0.731 0.464994    
`Has Pages`TRUE        0.06466    0.07141   0.906 0.365493    
`Has Discussions`TRUE  0.29440    0.07918   3.718 0.000217 ***
`Is Template`TRUE      0.04688    0.30825   0.152 0.879166    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6748 on 691 degrees of freedom
Multiple R-squared:  0.5301,	Adjusted R-squared:  0.5247 
F-statistic: 97.44 on 8 and 691 DF,  p-value: < 2.2e-16

We now use this full model to obtain out-of-sample predictions for repos in the testing set.

In [13]:
# Main developer: Ellie
# Contributor: YUXI
repo_test_pred_full <- predict(repo_full, newdata = testing_repo)
predicted_values <- data.frame(
  Index = 1:length(repo_test_pred_full),
  Predicted_Value = repo_test_pred_full
)
cat("Table 7: Predicted Values from the Full Model on the Testing Dataset")
head(predicted_values)
Table 7: Predicted Values from the Full Model on the Testing Dataset
A data.frame: 6 × 2
IndexPredicted_Value
<int><dbl>
116.138744
227.196477
335.491288
447.342788
556.373880
666.904508

Compute the Root Mean Squared Error (RMSE) and adjusted $R^2$ in order to evaluate the full predictive model.

In [14]:
# Main developer: Ellie
# Contributor: Macy
repo_RMSE_models <- tibble(
    Model = "Full Regression",
    RMSE = rmse(model = repo_full,
                data = testing_repo),
    Adj.R2 = summary(repo_full)$adj.r.squared)
cat("Table 8: Performance Metrics for the Full Regression Model")
repo_RMSE_models
Table 8: Performance Metrics for the Full Regression Model
A tibble: 1 × 3
ModelRMSEAdj.R2
<chr><dbl><dbl>
Full Regression0.70317730.5246514

Use forward selection to select a reduced linear regression model.

In [15]:
# Main developer: Ellie
repo_forward_selection <- leaps::regsubsets(
    x = Stars ~., nvmax = 8,
    data = training_repo,
    method = "forward")

repo_forward_summary <- summary(repo_forward_selection)
repo_forward_summary <- tibble(
    n_input_variables = 1:8,
    RSS = repo_forward_summary$rss,
    BIC = repo_forward_summary$bic,
    Cp = repo_forward_summary$cp)
summary(repo_forward_selection)
cat("Table 9: Model Selection Metrics for Different Numbers of Input Variables")
repo_forward_summary
Subset selection object
Call: regsubsets.formula(x = Stars ~ ., nvmax = 8, data = training_repo, 
    method = "forward")
8 Variables  (and intercept)
                      Forced in Forced out
Size                      FALSE      FALSE
Forks                     FALSE      FALSE
Issues                    FALSE      FALSE
`Has Projects`TRUE        FALSE      FALSE
`Has Wiki`TRUE            FALSE      FALSE
`Has Pages`TRUE           FALSE      FALSE
`Has Discussions`TRUE     FALSE      FALSE
`Is Template`TRUE         FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: forward
         Size Forks Issues `Has Projects`TRUE `Has Wiki`TRUE `Has Pages`TRUE
1  ( 1 ) " "  "*"   " "    " "                " "            " "            
2  ( 1 ) " "  "*"   "*"    " "                " "            " "            
3  ( 1 ) " "  "*"   "*"    " "                " "            " "            
4  ( 1 ) "*"  "*"   "*"    " "                " "            " "            
5  ( 1 ) "*"  "*"   "*"    " "                " "            "*"            
6  ( 1 ) "*"  "*"   "*"    " "                "*"            "*"            
7  ( 1 ) "*"  "*"   "*"    "*"                "*"            "*"            
8  ( 1 ) "*"  "*"   "*"    "*"                "*"            "*"            
         `Has Discussions`TRUE `Is Template`TRUE
1  ( 1 ) " "                   " "              
2  ( 1 ) " "                   " "              
3  ( 1 ) "*"                   " "              
4  ( 1 ) "*"                   " "              
5  ( 1 ) "*"                   " "              
6  ( 1 ) "*"                   " "              
7  ( 1 ) "*"                   " "              
8  ( 1 ) "*"                   "*"              
Table 9: Model Selection Metrics for Different Numbers of Input Variables
A tibble: 8 × 4
n_input_variablesRSSBICCp
<int><dbl><dbl><dbl>
1333.7928-474.214937.032810
2322.0809-492.666313.312553
3316.1614-499.1000 2.312994
4315.2965-494.4665 2.413627
5314.9206-488.7505 3.588071
6314.7055-482.6778 5.115642
7314.6634-476.2204 7.023129
8314.6528-469.6928 9.000000

Visualize the Mallow's Cp statistics of these models.

In [16]:
# Main developer: Ellie
options(repr.plot.width = 8, repr.plot.height = 6) 
plot(summary(repo_forward_selection)$cp,
     main = "Cp for forward selection",
     xlab = "Number of Input Variables", 
     ylab = "Cp",
     type = "b",
     pch = 19,
     col = "magenta"
)
cat("Figure 5: Cp Values for Forward Selection Across Different Numbers of Input Variables ")
Figure 5: Cp Values for Forward Selection Across Different Numbers of Input Variables 
No description has been provided for this image

We select the model with the lowest Cp value.

In [17]:
# Main developer: Ellie
repo_selection_summary <- summary(repo_forward_selection)
cp_min = which.min(repo_selection_summary$cp)
variables = repo_selection_summary$which[cp_min, ]
repo_reduced = lm(Stars ~ ., data = training_repo[variables])
summary(repo_reduced)
Call:
lm(formula = Stars ~ ., data = training_repo[variables])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.96497 -0.47861 -0.06419  0.38315  2.48105 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            3.60834    0.09807  36.795  < 2e-16 ***
Forks                  0.52875    0.02277  23.223  < 2e-16 ***
Issues                 0.07846    0.01899   4.131 4.04e-05 ***
`Has Discussions`TRUE  0.28259    0.07828   3.610 0.000328 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.674 on 696 degrees of freedom
Multiple R-squared:  0.5278,	Adjusted R-squared:  0.5258 
F-statistic: 259.4 on 3 and 696 DF,  p-value: < 2.2e-16

Compare the RMSE and adjusted $R^2$ of our selected reduced model with that of the full (baseline) model.

In [18]:
# Main developer: Ellie
# Contributor: Macy
repo_RMSE_models_compare <-
    rbind(
        model = repo_RMSE_models,
        data = tibble(Model = "Reduced Regression",
                      RMSE = rmse(model = repo_reduced,
                                  data = testing_repo),
                     Adj.R2 = summary(repo_reduced)$adj.r.squared))
cat("Table 10:  Comparison of Model Performance Metrics for Full and Reduced Regression Models")
repo_RMSE_models_compare
# Print names of selected covariates
selected_var <- names(coef(repo_forward_selection, cp_min))[-1]
selected_var
Table 10:  Comparison of Model Performance Metrics for Full and Reduced Regression Models
A tibble: 2 × 3
ModelRMSEAdj.R2
<chr><dbl><dbl>
1Full Regression 0.70317730.5246514
2Reduced Regression0.70061820.5258036
  1. 'Forks'
  2. 'Issues'
  3. '`Has Discussions`TRUE'

Interpretation of results
The table above shows that the Root Mean Squared Error of our reduced model, selected using forward selection and containing 3 covariates, is approximately the same as that of the full model containing all 8 covariates. This suggests that our reduced model has similar prediction performance compared with the full model. The Adjusted $R^2$ is also extreamly similar between the two models at apporximately 52.5%. This similarity between our full and reduced models suggests that only two or three of the variables used are responsible for most of our predictive power.

This is supported by the looking at our previous summaries for the full and reduced models where only the Intercept, Forks, Issues and Has.Discussions have $p$ values that are statistcally significant.

Discussion¶

In the previous sections we computed and analysed our reduced model obtained via forward selection and found that it utilizes 3 out of the 8 variables we included in our analysis. This reduced model and our full model both had Root Mean Squared Error's of approximately 0.7 and extremely similar $R^2$ of 52.5%. This has some implications for research questions, which are as follows:


Research questions:

  1. Which fundamental characteristics of a repository influence its popularity?
  2. Can these fundamental characteristics effectively predict the popularity of a repository?

Research Question 1)

In regards to the first question, we noticed during our interpretation of results that in both the full and reduced models only Forks, Issues and Has.Discussions are ever statistically significant. This implies that most of our predictive power originates from these three variables, and that the other variables have a minimal effect on our predictive power. This is supported by the small difference in reduced and full model RMSE's and $R^2$. In this way, we can tentatively conclude that Forks, Issues and Has.Discussions are fundamental characteristics of a repository that can influence its popularity.

This conclusion is further supported by existing literature (Hudson Borges, 2016; Junxiao Han, 2019), both of which found that Forks exhibits a strong positive correlation to Stars.

Our conclusion for this research question is about what we expected after looking into the existing research, especially in terms of Forks. The significance of Has.Discussions and Issues were outside our expectations, but aren't isn't surprising.


Research Question 2)

As for our second research question, it depends on if we are aiming to predict Stars accurately or if we are looking for predictions of general trends in repository popularity. Recall from our Exploratory Data Analysis that our values in the sample we use vary between a minimum of 5.124 and a maximum of 11.574. This is a range of 6.45. We can use this to normalize our RMSE, which results in an approximate 10.9% average error relative to our range.

This is not a particularly flattering assessment of our model's accuracy. If the intention of use behind this model is to accurately predict Stars, then this model doesn't effectively predict the popularity of a repository using fundamental characteristics. On the other hand, if our intention is to use this model to predict general trends in repository popularity, it is sufficient.

Keeping in mind that just because our model isn’t accurate doesn’t mean that a more sophisticated model won’t be accurate, we can't come to a conclusive answer for our second research question. While this isn’t what we hoped to find, it isn’t very surprising either.


Improvements

On that note, there are some improvements we could consider for the future. One is fine-tuning how we transform continuous variables via logs to achieve more linear relationships. While the logarithmic transformations we used are common for this purpose, they don’t always guarantee perfectly linear data. Further reports could explore using different transformations.

Another improvement we could make is to increase the amount of variables we use in our full model before performing forward selection.


Future Questions

Some future question this report could lead to include:

  • Why are Forks, Issues and Has.Discussions strong predictors of Stars?
  • What other methods can be used to achieve accurate predictions of Stars using the fundamental characteristics of a repository?
  • Can a reduced model with only one or two variables achieve accurate predictions for Stars?

References¶

  1. Hudson Borges, Andre Hora, and Marco Tulio Valente. (2016). Predicting the Popularity of GitHub Repositories. ASERG Group, Department of Computer Science (DCC), Federal University of Minas Gerais (UFMG), Brazil. ⃝c 2016 ACM. ISBN 978-1-4503-2138-9.
  2. Junxiao Han, Shuiguang Deng, Xin Xia, Dongjing Wang, and Jianwei Yin. (2019). Characterization and Prediction of Popular Projects on GitHub. Proceedings of the 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC). Zhejiang University, Monash University, and Hangzhou Dianzi University.